Add TableSink operator with Java/Spark implementations by harrygav · Pull Request #665 · apache/wayang

harrygav · 2026-01-19T23:24:30Z

Summary

This PR introduces a new TableSink operator for writing Record data into a database table via JDBC, with implementations for the Java and Spark platforms.

Opening as Draft to start discussion on the operator design and expected behavior.

Changes

New operator: TableSink (in wayang-basic)
- A UnarySink<Record> that targets a table name and accepts JDBC connection Properties
- Supports a write mode (e.g. overwrite) and optional column names
Java platform: JavaTableSink (in wayang-java)
- JDBC-based implementation that can create the target table (if missing) and batch-insert records
- Supports overwrite by dropping the target table first
Spark platform: SparkTableSink (in wayang-spark)
- Spark-side implementation of the same TableSink operator

Notes / open questions

This started as a PostgreSQL sink, but the intention should likely be a generic JDBC sink that works across multiple databases.
DDL generation is currently basic (e.g., columns are auto-created as VARCHARs)
mode behavior (overwrite vs append, etc.) should be agreed on and formalized.

How to use / test

To run end-to-end locally, you currently need an external PostgreSQL instance available and provide JDBC connection details (driver/url/user/password) in the test setup/environment.

juripetersen · 2026-01-20T08:07:34Z

Thanks @harrygav, this is great!

Could we make TableSink generic over its input type and thus make DDL generation easier with reflections on the given type?

novatechflow · 2026-01-20T08:14:54Z

Thank you - just to make the tests running, how's about mocking the JDBC layer?

Wrap DriverManager.getConnection/Connection in a small interface (e.g., JdbcClient) and inject a fake in tests. Then assert SQL statements and batch parameters without a real DB.

harrygav · 2026-01-20T15:32:21Z

Thanks @harrygav, this is great!

Could we make TableSink generic over its input type and thus make DDL generation easier with reflections on the given type?

I will take a look and update the PR to continue the discussion!

zkaoudi · 2026-02-11T19:58:14Z

Hey @harrygav, any news on this? Apparently a table sink is crucial for many things and we would like to start using it already.

zkaoudi · 2026-02-11T20:02:11Z

Also, another question: wouldn't it make sense to also have a sink for a database? Now you have implemented one execution operator for Java and one for Spark but why not for a database?
One reason where that would be desirable could be if you have data in one database and you would like to extract some data from it and import it into another one (or think an ETL pipeline).

harrygav · 2026-02-11T21:13:08Z

Hi @zkaoudi, nice to hear that the PR will be useful, I will follow up on this by the end of the week!

Thanks for your input, I think there are many things to be clarified for the sink operator, but I guess we will figure them out once we know more about the targeted use cases we want to cover. With the current implementation, you could do the ETL pipeline you mention through the Java or Spark platforms: e.g., Source(Java/Spark from DBMS1)->ETL(Java/Spark)->Sink(Java/Spark to DBMS2). Or where you thinking to write from DBMS1 into DBMS2 directly without any intermediate Java/Spark platform step? That would also be interesting for some use cases (improved perf) but becomes cumbersome to maintain in terms of interoperability.

Let me know what you think!

zkaoudi · 2026-02-12T10:45:22Z

Yes I was thinking about directly writing from DBMS1 to DBMS2. For example, if you have two tables in two DBMSs and you want to join them and write the result into DBMS2 without doing the join in Spark or java. What do you mean with issues of interoperability?

zkaoudi · 2026-02-12T10:47:13Z

But on a second thought, what I described above as a scenario is more like a conversion operator in addition to a sink. You would ideally want to create a temp table to do the join and then persist the result.

harrygav · 2026-02-23T21:03:26Z

Hi all, picking up this one again. I just pushed a commit with the update:

Support for more types, using reflection. Currently we support POJOs but also Wayang Records.
Support for different backend DBMSes, added the Calcite dependency to wayang-basic for that
Introduced more tests for the JavaTableSink and SparkTableSink
Added the H2 in-memory DBMS for the tests locally, removing the dependency on an external PostgreSQL DBMS running

I think it would be wise to add a couple of DBMSes for the tests, which would also be useful for the source operators or supporting JDBC platforms themselves. This could be done either through their embedded versions or through maven testcontainers. Then, we could add support for sinks on other platforms, e.g., the JDBC platform, to also support DBMS->DBMS workloads.

Let me know what you think about the PR, and if we want to do some of the next steps (e.g., testing) here or in another PR.

zkaoudi · 2026-02-24T16:09:33Z

.asf.yaml

-  description: Apache Wayang is the first cross-platform data processing system.
-  homepage: https://wayang.apache.org/
+  description: Apache Wayang(incubating) is the first cross-platform data processing system.
+  homepage: https://wayang.incubator.apache.org/


Do not modify this file, it seems it is an old version.

Sorry, this change file mistakenly sneaked in after the rebase.

zkaoudi · 2026-02-24T16:11:59Z

Thanks a lot for your contribution Harry!
Everything seems in order, except an override to a modified file (see review above).

harrygav · 2026-02-25T10:58:01Z

I can squash the commits and update the PR. Anything else to address?

zkaoudi · 2026-02-26T12:58:26Z

All seems good to me. We can merge. Thank you @harrygav

mspruc · 2026-02-26T14:19:11Z

@harrygav Just remember to clean up the rheem references and I'm also happy :)

harrygav · 2026-03-01T11:04:48Z

Thanks @mspruc, replaced the old rheem references to wayang and squashed commits. Let me know if you would like any other changes!

mspruc · 2026-03-01T13:16:23Z

@harrygav On my end you still have references to rheem in Java and Spark table sink

…plementation and tests

Copilot

Pull request overview

This PR introduces a new TableSink unary sink operator (in wayang-basic) and adds platform-specific implementations for Java and Spark to write Record/POJO data into JDBC tables, including basic schema inference via the new SqlTypeUtils.

Changes:

Added TableSink operator plus SqlTypeUtils for JDBC dialect/type/schema inference.
Added JavaTableSink implementation and accompanying H2-based unit tests.
Added SparkTableSink implementation and accompanying H2-based unit tests; updated platform/module POM dependencies.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 17 comments.

Show a summary per file

File	Description
wayang-platforms/wayang-spark/src/test/java/org/apache/wayang/spark/operators/SparkTableSinkTest.java	Adds Spark sink tests against H2 (but currently contains Checkstyle-breaking unused imports).
wayang-platforms/wayang-spark/src/main/java/org/apache/wayang/spark/operators/SparkTableSink.java	Implements Spark JDBC writing with Record schema derivation and POJO handling.
wayang-platforms/wayang-spark/pom.xml	Adds JDBC driver dependencies for Spark module (currently includes PostgreSQL in `compile` scope).
wayang-platforms/wayang-java/src/test/java/org/apache/wayang/java/operators/JavaTableSinkTest.java	Adds Java sink tests against H2.
wayang-platforms/wayang-java/src/main/java/org/apache/wayang/java/operators/JavaTableSink.java	Implements JDBC table creation and batch insertion for Java platform.
wayang-platforms/wayang-java/pom.xml	Adds test dependencies (H2 + PostgreSQL).
wayang-commons/wayang-basic/src/test/java/org/apache/wayang/basic/util/SqlTypeUtilsTest.java	Adds tests for dialect detection and schema/type mapping.
wayang-commons/wayang-basic/src/main/java/org/apache/wayang/basic/util/SqlTypeUtils.java	Adds utilities for JDBC URL product detection + Java→SQL type/schema mapping.
wayang-commons/wayang-basic/src/main/java/org/apache/wayang/basic/operators/TableSink.java	Adds the new logical operator holding table name, mode, props, and optional column names.
wayang-commons/wayang-basic/pom.xml	Adds Calcite dependency to support `SqlDialect.DatabaseProduct`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-02T11:50:33Z

...atforms/wayang-spark/src/test/java/org/apache/wayang/spark/operators/SparkTableSinkTest.java

+import org.apache.wayang.core.optimizer.OptimizationContext;
+import org.apache.wayang.core.plan.wayangplan.OutputSlot;
+import org.apache.wayang.core.platform.ChannelInstance;
+import org.apache.wayang.core.types.DataSetType;
+import org.apache.wayang.spark.channels.RddChannel;
+import org.apache.wayang.spark.platform.SparkPlatform;


This test has unused imports (OptimizationContext, OutputSlot, SparkPlatform), which will fail the repository Checkstyle run (google_checks.xml includes UnusedImports). Please remove the unused imports (and any now-unneeded mocking setup) to keep the build green.

Suggested change

import org.apache.wayang.core.optimizer.OptimizationContext;

import org.apache.wayang.core.plan.wayangplan.OutputSlot;

import org.apache.wayang.core.platform.ChannelInstance;

import org.apache.wayang.core.types.DataSetType;

import org.apache.wayang.spark.channels.RddChannel;

import org.apache.wayang.spark.platform.SparkPlatform;

import org.apache.wayang.core.platform.ChannelInstance;

import org.apache.wayang.core.types.DataSetType;

import org.apache.wayang.spark.channels.RddChannel;

Copilot · 2026-03-02T11:50:34Z

wayang-commons/wayang-basic/pom.xml

+            <groupId>org.apache.calcite</groupId>
+            <artifactId>calcite-core</artifactId>
+            <version>${calcite.version}</version>
+        </dependency>
        <dependency>


Adding calcite-core to wayang-basic just to access SqlDialect.DatabaseProduct significantly increases the dependency footprint of a low-level module and can introduce version-conflict pressure across the build. Consider replacing this with a small internal enum (or moving dialect-related utilities into a module that already depends on Calcite) to keep wayang-basic lightweight.

Suggested change

<groupId>org.apache.calcite</groupId>

<artifactId>calcite-core</artifactId>

<version>${calcite.version}</version>

</dependency>

<dependency>

I kind of agree with copilot here.. (btw, who instantiated this copilot review?) @harrygav @mspruc what do you think?

Copilot · 2026-03-02T11:50:34Z

...g-platforms/wayang-spark/src/main/java/org/apache/wayang/spark/operators/SparkTableSink.java

+            if (recordRDD.isEmpty()) {
+                return ExecutionOperator.modelEagerExecution(inputs, outputs, operatorContext);
+            }
+            Record first = (Record) recordRDD.first();


For Record inputs, this does two Spark actions (isEmpty() and then first()), which triggers two jobs and can be expensive. Prefer a single take(1) (or similar) to both detect emptiness and obtain a sample record for schema inference.

Suggested change

if (recordRDD.isEmpty()) {

return ExecutionOperator.modelEagerExecution(inputs, outputs, operatorContext);

}

Record first = (Record) recordRDD.first();

List<T> sample = recordRDD.take(1);

if (sample.isEmpty()) {

return ExecutionOperator.modelEagerExecution(inputs, outputs, operatorContext);

}

Record first = (Record) sample.get(0);

Copilot · 2026-03-02T11:50:34Z

wayang-platforms/wayang-java/pom.xml

+        <dependency>
+            <groupId>org.postgresql</groupId>
+            <artifactId>postgresql</artifactId>
+            <version>42.7.2</version>
+            <scope>test</scope>
+        </dependency>


org.postgresql:postgresql is added as a test dependency in the Java platform module, but the new tests use H2 and there is no usage of the PostgreSQL driver in this module’s test sources. If it’s not needed for other tests, removing it will reduce build time/dependency surface; otherwise, consider adding a test that actually covers PostgreSQL-specific behavior.

Suggested change

<dependency>

<groupId>org.postgresql</groupId>

<artifactId>postgresql</artifactId>

<version>42.7.2</version>

<scope>test</scope>

</dependency>

Copilot · 2026-03-02T11:50:35Z

wayang-commons/wayang-basic/src/main/java/org/apache/wayang/basic/util/SqlTypeUtils.java

+    public static List<SchemaField> getSchema(Class<?> cls, SqlDialect.DatabaseProduct product) {
+        List<SchemaField> schema = new ArrayList<>();
+        if (cls == Record.class) {
+            // For Record.class without an instance, we can't derive names/types easily
+            // Users should use the instance-based getSchema or provide columnNames
+            return schema;
+        }
+
+        for (Field field : cls.getDeclaredFields()) {
+            if (java.lang.reflect.Modifier.isStatic(field.getModifiers())) {
+                continue;
+            }
+            schema.add(new SchemaField(field.getName(), field.getType(), getSqlType(field.getType(), product)));
+        }
+        return schema;


POJO schema derivation uses getDeclaredFields(), but value extraction uses ReflectionUtils.getProperty(...), which requires matching getX() getters. This will fail for private fields without getters (even though they are included in the schema) and can also include fields that should not be persisted. Consider deriving schema from JavaBeans properties/getters (or at least filtering to fields that have corresponding getters).

Copilot · 2026-03-02T11:50:37Z

wayang-platforms/wayang-java/src/main/java/org/apache/wayang/java/operators/JavaTableSink.java

+            recordIterator.forEachRemaining(
+                    r -> {
+                        try {
+                            this.pushToStatement(ps, r, typeClass, finalColumnNames);
+                            ps.addBatch();
+                        } catch (SQLException e) {
+                            e.printStackTrace();
+                        }
+                    });


recordIterator.forEachRemaining swallows SQLException via printStackTrace() and keeps building the batch, which can lead to partial/incorrect writes without failing the operator. Propagate the failure (e.g., wrap in a runtime exception), then abort/rollback and surface it as a WayangException so the job fails deterministically.

Copilot · 2026-03-02T11:50:37Z

wayang-platforms/wayang-java/src/main/java/org/apache/wayang/java/operators/JavaTableSink.java

+        if (typeClass == Record.class) {
+            Record r = (Record) element;
+            for (int i = 0; i < columnNames.length; i++) {
+                setRecordValue(ps, i + 1, r.getField(i));
+            }
+        } else {


For Record inputs, the code assumes columnNames.length <= record.size() and indexes fields by position. If columnNames is longer than the Record, r.getField(i) will throw at runtime. Please validate the lengths early (and give a clear error) or derive columnNames from the record size when not provided.

Copilot · 2026-03-02T11:50:37Z

wayang-commons/wayang-basic/src/main/java/org/apache/wayang/basic/util/SqlTypeUtils.java

+        for (Field field : cls.getDeclaredFields()) {
+            if (java.lang.reflect.Modifier.isStatic(field.getModifiers())) {
+                continue;
+            }
+            schema.add(new SchemaField(field.getName(), field.getType(), getSqlType(field.getType(), product)));
+        }
+        return schema;


Class#getDeclaredFields() does not guarantee a stable order across JVMs/compilers, but the derived schema order determines column ordering (and tests assume a specific order). To avoid nondeterministic schemas, sort fields deterministically (e.g., by name) or use a stable property-introspection approach.

Copilot · 2026-03-02T11:50:37Z

...g-platforms/wayang-spark/src/main/java/org/apache/wayang/spark/operators/SparkTableSink.java

+            // Update column names in the operator if they were generated
+            String[] newColNames = schemaFields.stream().map(SqlTypeUtils.SchemaField::getName).toArray(String[]::new);
+            this.setColumnNames(newColNames);
+


this.setColumnNames(...) mutates the operator based on runtime data (first record). Combined with TableSink copying columnNames by reference, this can leak schema decisions across copies/reuses. Prefer keeping inferred column names local to the evaluation (or ensure defensive copies in TableSink and avoid mutating shared state).

Suggested change

// Update column names in the operator if they were generated

String[] newColNames = schemaFields.stream().map(SqlTypeUtils.SchemaField::getName).toArray(String[]::new);

this.setColumnNames(newColNames);

Copilot · 2026-03-02T11:50:38Z

...g-platforms/wayang-spark/src/main/java/org/apache/wayang/spark/operators/SparkTableSink.java

+            // If columnNames are provided, we should probably select/rename them,
+            // but usually createDataFrame(rdd, beanClass) maps fields to columns.
+            if (this.getColumnNames() != null && this.getColumnNames().length > 0) {
+                // Optionally filter or reorder columns to match this.getColumnNames()
+                // For now, Spark's native mapping is preferred.


In the POJO branch, columnNames passed to TableSink are silently ignored (the code comments mention it but no behavior is applied). This makes the API misleading because callers might expect renaming/reordering or subset selection. Either implement the mapping (select/rename) or validate and reject columnNames for POJO inputs with a clear error.

Suggested change

// If columnNames are provided, we should probably select/rename them,

// but usually createDataFrame(rdd, beanClass) maps fields to columns.

if (this.getColumnNames() != null && this.getColumnNames().length > 0) {

// Optionally filter or reorder columns to match this.getColumnNames()

// For now, Spark's native mapping is preferred.

// For POJOs, we currently do not support custom columnNames to avoid

// ambiguous or misleading mappings. Fail fast if they are provided.

String[] columnNames = this.getColumnNames();

if (columnNames != null && columnNames.length > 0) {

throw new WayangException(

"columnNames are not supported for POJO inputs in SparkTableSink. " +

"Either omit columnNames or use Record inputs if you need custom column mapping.");

zkaoudi requested changes Feb 24, 2026

View reviewed changes

zkaoudi previously approved these changes Feb 26, 2026

View reviewed changes

zkaoudi marked this pull request as ready for review February 26, 2026 12:57

harrygav force-pushed the introduce_postgresql_sink branch from 7cd4822 to ccac0b2 Compare March 1, 2026 11:02

introduce abstract TableSink operator with Java and Spark platform im…

b2a59c4

…plementation and tests

harrygav dismissed zkaoudi’s stale review via b2a59c4 March 2, 2026 11:43

Copilot AI review requested due to automatic review settings March 2, 2026 11:43

harrygav force-pushed the introduce_postgresql_sink branch from ccac0b2 to b2a59c4 Compare March 2, 2026 11:43

Copilot started reviewing on behalf of harrygav March 2, 2026 11:44 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

	// Update column names in the operator if they were generated
	String[] newColNames = schemaFields.stream().map(SqlTypeUtils.SchemaField::getName).toArray(String[]::new);
	this.setColumnNames(newColNames);

-            // If columnNames are provided, we should probably select/rename them,
-            // but usually createDataFrame(rdd, beanClass) maps fields to columns.
-            if (this.getColumnNames() != null && this.getColumnNames().length > 0) {
-                // Optionally filter or reorder columns to match this.getColumnNames()
-                // For now, Spark's native mapping is preferred.
+            // For POJOs, we currently do not support custom columnNames to avoid
+            // ambiguous or misleading mappings. Fail fast if they are provided.
+            String[] columnNames = this.getColumnNames();
+            if (columnNames != null && columnNames.length > 0) {
+                throw new WayangException(
+                        "columnNames are not supported for POJO inputs in SparkTableSink. " +
+                        "Either omit columnNames or use Record inputs if you need custom column mapping.");

Conversation

harrygav commented Jan 19, 2026

Summary

Changes

Notes / open questions

How to use / test

Uh oh!

juripetersen commented Jan 20, 2026

Uh oh!

novatechflow commented Jan 20, 2026

Uh oh!

harrygav commented Jan 20, 2026

Uh oh!

zkaoudi commented Feb 11, 2026

Uh oh!

zkaoudi commented Feb 11, 2026

Uh oh!

harrygav commented Feb 11, 2026

Uh oh!

zkaoudi commented Feb 12, 2026

Uh oh!

zkaoudi commented Feb 12, 2026

Uh oh!

harrygav commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zkaoudi Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

harrygav Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

zkaoudi commented Feb 24, 2026

Uh oh!

harrygav commented Feb 25, 2026

Uh oh!

zkaoudi commented Feb 26, 2026

Uh oh!

mspruc commented Feb 26, 2026

Uh oh!

harrygav commented Mar 1, 2026

Uh oh!

mspruc commented Mar 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

zkaoudi Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

harrygav commented Feb 23, 2026 •

edited

Loading